Search for Samples or Studies

R
Authors
Affiliations

Sandy Rogers

MGnify team at EMBL-EBI

Ben Allen

Newcastle University

This is a static preview

You can run and edit these examples interactively on Galaxy

Search for MGnify Studies or Samples, using MGnifyR

The MGnify API returns data and relationships as JSON. MGnifyR is a package to help you read MGnify data into your R analyses.

This example shows you how to perform a search of MGnify Studies or Samples

You can find all of the other “API endpoints” using the Browsable API interface in your web browser. This interface also lets you inspect the kinds of Filters that can be created for each list.

This is an interactive code notebook (a Jupyter Notebook). To run this code, click into each cell and press the ▶ button in the top toolbar, or press shift+enter.


library(IRdisplay)
display_markdown(file = '../_resources/mgnifyr_help.md')

Help with MGnifyR

MGnifyR is an R package that provides a convenient way for R users to access data from the MGnify API.

Detailed help for each function is available in R using the standard ?function_name command (i.e. typing ?mgnify_query will bring up built-in help for the mgnify_query command).

A vignette is available containing a reasonably verbose overview of the main functionality. This can be read either within R with the vignette("MGnifyR") command, or in the development repository

MGnifyR Command cheat sheet

The following list of key functions should give a starting point for finding relevent documentation.

  • mgnify_client() : Create the client object required for all other functions.
  • mgnify_query() : Search the whole MGnify database.
  • mgnify_analyses_from_xxx() : Convert xxx accessions to analyses accessions. xxx is either samples or studies.
  • mgnify_get_analyses_metadata() : Retrieve all study, sample and analysis metadata for given analyses.
  • mgnify_get_analyses_phyloseq() : Convert abundance, taxonomic, and sample metadata into a single phyloseq object.
  • mgnify_get_analyses_results() : Get functional annotation results for a set of analyses.
  • mgnify_download() : Download raw results files from MGnify.
  • mgnify_retrieve_json() : Low level API access helper function.

Load packages:

library(dplyr)
library(vegan)
library(ggplot2)
library(phyloseq)
library(MGnifyR)

mg <- mgnify_client(usecache = T, cache_dir = '/tmp/mgnify_cache')

Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: permute

Loading required package: lattice

This is vegan 2.6-4

Contents

Documentation for mgnify_query

?mgnify_query

Example: find Polar samples

In these examples we set maxhits=1 to retrieve only the first page of results. You can change the limit or set it to -1 to retrieve all samples matching the query.

samps_np <- mgnify_query(mg, "samples", latitude_gte=88, maxhits=1)
samps_sp <- mgnify_query(mg, "samples", latitude_lte=-88, maxhits=1)
samps_polar <- bind_rows(samps_np, samps_sp)
head(samps_polar)
A data.frame: 6 × 53
latitude longitude biosample accession analysis-completed collection-date geo-loc-name sample-desc environment-biome environment-feature size fraction lower threshold size fraction upper threshold temperature salinity target gene host-tax-id altitude host common name host taxid host scientific name
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
ERS1972786 88.8268 58.6275 SAMEA104347808 ERS1972786 2018-06-29 2012-09-22 Arctic Ocean deposited algae aggregate from the deep-sea floor marine abyssal zone biome [ENVO:01000027]|polar biome [ENVO_01000339] abyssal plain [ENVO:00000244] NA NA NA NA NA NA NA NA NA NA
ERS1972795 88.8145 57.7384 SAMEA104347817 ERS1972795 2018-06-29 2012-09-22 Central Arctic Ocean/Eurasian basin deposited algae aggregate from the deep-sea floor marine abyssal zone biome [ENVO:01000027]|polar biome [ENVO_01000339] abyssal plain [ENVO:00000244] NA NA NA NA NA NA NA NA NA NA
ERS1972765 88.8277 58.8635 SAMEA104347787 ERS1972765 2018-06-29 2012-09-22 Arctic Ocean upper half of a sea-ice core oceanic epipelagic zone biome [ENVO:01000035]|marine pelagic biome [ENVO:01000023]|polar biome [ENVO_01000339] sea ice floe [ENVO:03000066] NA NA NA NA NA NA NA NA NA NA
ERS1972794 88.8268 58.6275 SAMEA104347816 ERS1972794 2018-06-29 2012-09-22 Arctic Ocean deposited algae aggregate from the deep-sea floor marine abyssal zone biome [ENVO:01000027]|polar biome [ENVO_01000339] abyssal plain [ENVO:00000244] NA NA NA NA NA NA NA NA NA NA
ERS1972745 88.8277 58.8635 SAMEA104347767 ERS1972745 2018-06-29 2012-09-22 Central Arctic Ocean/Eurasian basin brownish colored sea ice oceanic epipelagic zone biome [ENVO:01000035]|polar biome [ENVO_01000339] sea ice floe [ENVO:03000066] NA NA NA NA NA NA NA NA NA NA
ERS1972756 88.8277 58.8635 SAMEA104347778 ERS1972756 2018-06-29 2012-09-22 Arctic Ocean lower half of a sea-ice core oceanic epipelagic zone biome [ENVO:01000035]|marine pelagic biome [ENVO:01000023]|polar biome [ENVO_01000339] sea ice floe [ENVO:03000066] NA NA NA NA NA NA NA NA NA NA

Example: find Wastewater studies

studies_ww <- mgnify_query(mg, "studies", biome_name="wastewater", maxhits=1)
head(studies_ww)
A data.frame: 6 × 12
samples-count accession bioproject is-private last-update secondary-accession centre-name study-abstract study-name data-origination acc_type type
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
MGYS00006570 81 MGYS00006570 PRJEB71375 FALSE 2024-01-23T18:35:35 ERP156179 EMG The Third Party Annotation (TPA) assembly was derived from the primary whole genome shotgun (WGS) data set PRJNA230567, and was assembled with metaspades v3.15.3. This project includes samples from the following biomes: root:Engineered:Wastewater. EMG produced TPA metagenomics assembly of PRJNA230567 data set (Systems Biology of Lipid Accumulating Organisms). SUBMITTED studies studies
MGYS00006558 85 MGYS00006558 PRJNA230567 FALSE 2023-12-19T12:35:08 SRP033648 Luxembourg Centre for Systems Biomedicine Characterization of microbial communities at the genomic, transcriptomic, proteomic and metabolomic levels, with a special interest on lipid accumulating bacterial populations, which are naturally enriched in biological wastewater treatment systems and may be harnessed for the conversion of mixed lipid substrates (wastewater) into biodiesel. The project aims to elucidate the genetic blueprints and the functional relevance of specific populations within the community. It focuses on within-population genetic and functional heterogeneity, trying to understand how fine-scale variations contribute to differing lipid accumulating phenotypes. Insights from this project will contribute to the understanding the functioning of microbial ecosystems, and improve optimization and modeling strategies for current and future biological wastewater treatment processes. This BioProject contains datasets derived from the same biological wastewater treatment plant. The date includes metagenomes, metatranscriptomes and organisms isolated in pure cultures. Systems Biology of Lipid Accumulating Organisms HARVESTED studies studies
MGYS00005985 1 MGYS00005985 PRJEB45225 FALSE 2022-03-11T21:49:39 ERP129301 EMG The Third Party Annotation (TPA) assembly was derived from the primary whole genome shotgun (WGS) data set PRJNA593593, and was assembled with metaSPAdes v3.15.2. This project includes samples from the following biomes: root:Engineered:Wastewater. EMG produced TPA metagenomics assembly of PRJNA593593 data set (Sewage microbial communities from Oakland, California, United States - Biofuel Metagenome 10). SUBMITTED studies studies
MGYS00005997 1 MGYS00005997 PRJEB45727 FALSE 2022-03-11T21:12:02 ERP129875 EMG The Third Party Annotation (TPA) assembly was derived from the primary whole genome shotgun (WGS) data set PRJNA593594, and was assembled with metaSPAdes v3.15.2. This project includes samples from the following biomes: root:Engineered:Wastewater. EMG produced TPA metagenomics assembly of PRJNA593594 data set (Sewage microbial communities from Oakland, California, United States - Biofuel Metagenome 11). SUBMITTED studies studies
MGYS00005986 1 MGYS00005986 PRJNA593593 FALSE 2022-02-28T14:04:08 SRP270050 DOE Joint Genome Institute Sewage-derived enrichment culture (anaerobic medium, 0.1% glucose), planktonic phase Sewage microbial communities from Oakland, California, United States - Biofuel Metagenome 10 HARVESTED studies studies
MGYS00002316 1 MGYS00002316 PRJEB24109 FALSE 2022-02-03T15:58:54 ERP105914 EMBL-EBI The activated sludge metagenome Third Party Annotation (TPA) assembly was derived from the primary whole genome shotgun (WGS) data set: PRJNA340752. This project includes samples from the following biomes: Engineered, Wastewater, Activated Sludge. EMG produced TPA metagenomics assembly of the Active sludge microbial communities of municipal wastewater-treating anaerobic digesters from China - AD_SCU002_MetaG metagenome (activated sludge metagenome) data set. SUBMITTED studies studies

More filters to try:

Samples by location

more_northerly_than <- mgnify_query(mg, "samples", latitude_gte=88, maxhits=1)

more_southerly_than <- mgnify_query(mg, "samples", latitude_lte=-88, maxhits=1)

more_easterly_than <- mgnify_query(mg, "samples", longitude_gte=170, maxhits=1)

more_westerly_than <- mgnify_query(mg, "samples", longitude_lte=170, maxhits=1)

at_location <- mgnify_query(mg, "samples", geo_loc_name="usa", maxhits=1)

Samples by biome

biome_within_wastewater <- mgnify_query(mg, "samples", biome_name="wastewater", maxhits=1)

Samples by metadata

There are a large number of metadata key:value pairs, because these are author-submitted, along with the samples, to the ENA archive.

If you know how to specify the metadata key:value query for the samples you’re interested in, you can use this form to find matching Samples:

from_ex_smokers <- mgnify_query(mg, "samples", metadata_key="smoker", metadata_value="ex-smoker", maxhits=-1)

To find metadata_keys and values, it is best to browse the interactive API Browser, and use the Filters button to construct queries interactively at first.

Studies by centre name

from_smithsonian <- mgnify_query(mg, "studies", centre_name="Smithsonian", maxhits=-1)

To find metadata_keys and values, it is best to browse the interactive API Browser, and use the Filters button to construct queries interactively at first.


Example: adding additional filters to the data frame

First, fetch some samples from the Lentic biome. We can specify the entire Biome lineage, too.

lentic_samples <- mgnify_query(mg, "samples", biome_name="root:Environmental:Aquatic:Lentic", usecache=T)

Now, also filter by depth within the returned results, using normal R syntax.

depth_numeric = as.numeric(lentic_samples$depth)  # We must convert data from MGnifyR (always strings) to numerical format.
depth_numeric[is.na(depth_numeric)] = 0.0  # If depth data is missing, assume it is surface-level.
lentic_subset = lentic_samples[depth_numeric >=25 & depth_numeric <=50,]  # Filter to samples collected between 25m and 50m down.
lentic_subset
A data.frame: 16 × 37
biosample latitude longitude accession collection-date sample-desc sample-name sample-alias last-update geographic location (longitude) instrument model last update date investigation type project name geographic location (depth) geographic location (altitude) environmental package sequencing method NCBI sample classification ENA checklist
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
SRS992699 SAMN03860260 17.39 40.54 SRS992699 2011-10-15 12 sample03 sample03 2020-05-18T00:52:00 40.54 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992702 SAMN03860274 20.31 38.46 SRS992702 2011-10-15 91 sample17 sample17 2020-05-18T00:51:47 38.46 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992693 SAMN03860259 17.39 40.54 SRS992693 2011-10-15 12 sample02 sample02 2020-05-18T00:50:43 40.54 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992705 SAMN03860286 23.36 37.3 SRS992705 2011-10-15 149 sample29 sample29 2020-05-18T00:46:05 37.3 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992692 SAMN03860268 18.34 40.44 SRS992692 2011-10-15 34 sample11 sample11 2020-05-18T00:45:26 40.44 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992710 SAMN03860281 22.2 37.55 SRS992710 2011-10-15 108 sample24 sample24 2020-05-18T00:35:28 37.55 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992714 SAMN03860292 25.46 36.6 SRS992714 2011-10-15 169 sample35 sample35 2020-05-18T00:35:15 36.6 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992704 SAMN03860287 23.36 37.3 SRS992704 2011-10-15 149 sample30 sample30 2020-05-18T00:27:10 37.3 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992713 SAMN03860293 25.46 36.6 SRS992713 2011-10-15 169 sample36 sample36 2020-05-18T00:13:07 36.6 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992696 SAMN03860280 22.2 37.55 SRS992696 2011-10-15 108 sample23 sample23 2020-05-18T00:06:42 37.55 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992729 SAMN03860263 17.59 39.47 SRS992729 2011-10-15 22 sample06 sample06 2020-05-17T10:16:43 39.47 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992701 SAMN03860275 20.31 38.46 SRS992701 2011-10-15 91 sample18 sample18 2020-05-17T10:11:28 38.46 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992691 SAMN03860269 18.34 40.44 SRS992691 2011-10-15 34 sample12 sample12 2020-05-17T10:01:44 40.44 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992721 SAMN03860262 17.59 39.47 SRS992721 2011-10-15 22 sample05 sample05 2020-05-17T10:01:21 39.47 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992720 SAMN03860299 27.53 34.3 SRS992720 2011-10-15 192 sample42 sample42 2020-05-17T10:00:58 34.3 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992719 SAMN03860298 27.53 34.3 SRS992719 2011-10-15 192 sample41 sample41 2020-05-17T10:00:45 34.3 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA